The data contains more than 60,000 samples of hypothetical
mushrooms corresponding to 173 species. Each observation represents a
mushroom defined by 21 different attributes.
This dataset includes 61069 hypothetical mushrooms with caps based on 173 species (353 mushrooms per species
The problem at hand, is a binary classification problem, with
‘class’ being our response variable. The initial dataset with 21
variables is condensed to 15 variables as some variables conveyed the
same information, for example ‘cap color’ and ‘gill color’ essentially
were the same color, so this way redundant variables have been
removed.
Our binary response class is divided into ‘edible’ and
‘poisonous’, with ‘0’ representing edible and ‘1’ as poisonous.
The structure of the dataset is given below, where ‘n’ and ‘m’
are nominal and metrical variables respectively.
1. cap-diameter (m): float number in cm
2. cap-shape (n):
bell=b, conical=c, convex=x, flat=f, sunken=s, spherical=p, others=o
3. cap-surface (n): fibrous=i, grooves=g, scaly=y, smooth=s,
shiny=h, leathery=l, silky=k, sticky=t, wrinkled=w, fleshy=e
4.
cap-color (n): brown=n, buff=b, gray=g, green=r, pink=p, purple=u,
red=e, white=w, yellow=y, blue=l, orange=o, black=k
5.
does-bruise-bleed (n): bruises-or-bleeding=t,no=f
6. gill-attachment
(n): adnate=a, adnexed=x, decurrent=d, free=e, sinuate=s, pores=p,
none=f, unknown=?
7. gill-spacing (n): close=c, distant=d, none=f
8. gill-color (n): see cap-color + none=f
9. stem-height (m):
float number in cm
10. stem-width (m): float number in mm
11.
stem-root (n): bulbous=b, swollen=s, club=c, cup=u, equal=e,
rhizomorphs=z, rooted=r
12. stem-surface (n): see cap-surface +
none=f
13. stem-color (n): see cap-color + none=f
14. veil-type
(n): partial=p, universal=u
15. veil-color (n): see cap-color +
none=f
16. has-ring (n): ring=t, none=f
17. ring-type (n):
cobwebby=c, evanescent=e, flaring=r, grooved=g, large=l, pendant=p,
sheathing=s, zone=z, scaly=y, movable=m, none=f, unknown=?
18.
spore-print-color (n): see cap color
19. habitat (n): grasses=g,
leaves=l, meadows=m, paths=p, heaths=h, urban=u, waste=w, woods=d
20. season (n): spring=s, summer=u, autumn=a, winter=w
## Classes 'data.table' and 'data.frame': 37065 obs. of 15 variables:
## $ class : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
## $ cap.diameter : num 1.72 0.86 2.8 1.18 2.46 ...
## $ cap.shape : Factor w/ 7 levels "b","c","f","o",..: 7 7 7 6 1 7 7 7 7 7 ...
## $ cap.surface : Factor w/ 11 levels "d","e","g","h",..: 8 3 8 8 1 6 8 9 9 9 ...
## $ cap.color : Factor w/ 12 levels "b","e","g","k",..: 12 8 11 12 8 11 11 12 6 6 ...
## $ does.bruise.or.bleed: Factor w/ 2 levels "f","t": 1 1 1 1 1 2 2 2 2 1 ...
## $ gill.attachment : Factor w/ 7 levels "a","d","e","f",..: 2 1 2 4 1 2 3 7 5 6 ...
## $ gill.color : Factor w/ 12 levels "b","e","f","g",..: 11 8 11 3 9 11 4 12 12 8 ...
## $ stem.height : num 7.01 4.25 3.13 3.39 5.03 ...
## $ stem.width : num 2.36 1.05 5.69 5.6 6.05 ...
## $ stem.color : Factor w/ 12 levels "e","f","g","k",..: 12 4 11 12 9 11 11 11 6 11 ...
## $ has.ring : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 2 1 2 2 ...
## $ ring.type : Factor w/ 7 levels "e","f","g","l",..: 2 2 2 2 2 2 6 2 5 7 ...
## $ habitat : Factor w/ 7 levels "d","g","h","l",..: 3 1 5 1 2 1 5 1 1 1 ...
## $ season : Factor w/ 4 levels "a","s","u","w": 3 2 1 3 1 3 1 3 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:24004] 2 4 12 13 16 18 21 22 27 31 ...
## ..- attr(*, "names")= chr [1:24004] "2" "4" "12" "13" ...
## - attr(*, ".internal.selfref")=<externalptr>
The distribution of our response class is as follows.
##
## 0 1
## 16944 20121
An inference made from the habitat bar-plot is that mushroooms
found in the waste are edible, but that definitely does not mean we can
eat it, I guess!
And quite not what I expected, most of the
mushrooms found in the urban areas seem to be poisonous.
The plot below tells us that most of the samples in our data
belong to convex and sunken gill-shaped mushrooms. And it is understood
that a mushroom with either of these two shapes has an almost equal
chance of being edible or poisonous.
Rings are skirt-shaped parts that are usually under the gill of
the mushroom. And from the plot below, it looks like mushrooms with no
rings attached to it are unsafe to eat.
Let’s split our data into training and testing sets before building
any models. Our training set has 6500 observations while testing set has
1624. The proportion of edibility in our training and testing sets is
given below.
##
## 0 1
## 13556 16097
##
## 0 1
## 3388 4024
Below is the logistic regression model.
## [1] NA
## [1] 0.6377496
## [1] 0.7231516
## [1] 0.7857528
## [1] 0.801538
## [1] 0.7938478
## [1] 0.7817053
## [1] 0.7601187
## [1] 0.7238262
## [1] 0.659606
## [1] NA
## [1] NA
## [1] 0.9880716
## [1] 0.9642147
## [1] 0.9463221
## [1] 0.8774851
## [1] 0.791501
## [1] 0.7236581
## [1] 0.638171
## [1] 0.527833
## [1] 0.3822068
## [1] NA
For a threshold of 0.45, we get following accuracy and recall
values repectively.
## predicted_log
## 0
## 0 3388
## 1 4024
## predicted_log
## 0 1
## 0 2699 689
## 1 839 3185
## [1] 0.7938478
## [1] 0.791501
The ROC- curve is shown below for true positive rate vs false
positive rate. It follows the ideal ROC curve path where accuracy is
100%. There is no room for false positives in this situation and the
logistic regression model performs perfect.
Stacked histograms to visualizee LDA:
There is noticeable
overlapping from the histograms, indicating they are not accurately
classified.
## [1] 0.7862274
## [1] 0.7921352
Confusion Matrix for LDA on testing data:
## Predicted
## Actual 0 1
## 0 2643 745
## 1 816 3208
## [1] 0.7893956
## [1] 0.7972167
## [1] 0.2106044
The decision tree classifier is a straightforward approach for a
binomial classification problem. Infact, since all the predictors are
categorical, it makes it possible for one to interpret the algorithm
without any prior knowledge in trees.
This is for the default values of cp in the ‘rpart’ library,
which is ‘0.01’
## test_tree
## 0 1
## 0 2970 418
## 1 138 3886
## [1] 0.9249865
## [1] 0.9657058
With cp=0:
Clear signs of overfitting. Extremely high accuracy levels.
Accuracy and Recall for training and testing data respectively.
##
## train_basetree 0 1
## 0 13498 21
## 1 58 16076
## [1] 0.9973359
## [1] 0.9964051
## test_basetree
## 0 1
## 0 3378 10
## 1 15 4009
## [1] 0.9966271
## [1] 0.9962724
## [1] 0.003372909
Pruned Tree:
##
## train_basetree 0 1
## 0 13369 179
## 1 187 15918
## [1] 0.9876572
## [1] 0.9883887
## test_basetree
## 0 1
## 0 3348 40
## 1 45 3979
## [1] 0.9885321
## [1] 0.9888171
## [1] 0.01146789
Although the model has 99.3%, there should be models performing
better than this. This model has some chance of making errors. It can be
enhanced by using random forests.
##
## Call:
## randomForest(formula = class ~ ., data = train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.01%
## Confusion matrix:
## 0 1 class.error
## 0 13556 0 0.0000000000
## 1 4 16093 0.0002484935
By default, R picks 500 trees, but anything beyond 200 trees
doesn’t seem to reduce the error further.
## [1] 0.9997302
## [1] 0.999503
##
## Call:
## randomForest(formula = class ~ ., data = train, mtry = 3, ntree = 200, importance = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.02%
## Confusion matrix:
## 0 1 class.error
## 0 13555 1 7.376807e-05
## 1 5 16092 3.106169e-04
## [1] 0.9997302
## [1] 0.999503
I’m curious to see what the major attributes are in classifying
mushrooms.
Initial random forest variable importance vs tuned
random forest.
The model does well on the training data having an Out-of_bag
error rate of 0%. Having perfect prediction on the testing data would be
ideal as well. So, let’s check how the model performs on the testing
data.
The confusion matrix for the testing data is given below. There
is 100% accuracy of our model, which is exactly what we’re looking
for.
## [1] 0.9997302
## [1] 0.999503
Running the decision tree algorithm on just the above important
predictors still seems to fetch decent results.
##
## 0 1
## 13556 16097
##
## 0 1
## 3388 4024
## [1] 0.8688613
## [1] 0.8565421
##
## Call:
## randomForest(formula = class ~ ., data = train1, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.76%
## Confusion matrix:
## 0 1 class.error
## 0 13416 140 0.010327530
## 1 85 16012 0.005280487
## [1] 0.9924447
## [1] 0.9925521
knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE)
library(data.table)
#secondary_data <- read.csv("~/Desktop/FIN_DATA_MINING/data mining data/secondary_data.csv", sep=";")
#mushrooms=secondary_data[,c(-8,-9,-13,-14,-16,-19)]
#str(mushrooms)
#setDT(mushrooms)
#head(mushrooms,5)
secondary_data_shuffled <- read.csv("~/Downloads/files/SecondaryData/secondary_data_shuffled.csv", sep=";")
mushrooms1=copy(secondary_data_shuffled)
mushrooms1=mushrooms1[,c(-8,-12,-13,-15,-16,-19)]
mushrooms1$class[mushrooms1$class==""]=NA
mushrooms1$cap.diameter[mushrooms1$cap.diameter==""]=NA
mushrooms1$cap.shape[mushrooms1$cap.shape==""]=NA
mushrooms1$cap.color[mushrooms1$cap.color==""]=NA
mushrooms1$cap.surface[mushrooms1$cap.surface==""]=NA
mushrooms1$does.bruise.or.bleed[mushrooms1$does.bruise.or.bleed==""]=NA
mushrooms1$gill.attachment[mushrooms1$gill.attachment==""]=NA
mushrooms1$stem.height[mushrooms1$stem.height==""]=NA
mushrooms1$stem.width[mushrooms1$stem.width==""]=NA
mushrooms1$stem.color[mushrooms1$stem.color==""]=NA
#mushrooms1$stem.root[mushrooms1$stem.root==""]=NA
#mushrooms1$veil.type[mushrooms1$veil.type==""]=NA
mushrooms1$has.ring[mushrooms1$has.ring==""]=NA
mushrooms1$ring.type[mushrooms1$ring.type==""]=NA
mushrooms1$habitat[mushrooms1$habitat==""]=NA
mushrooms1$season[mushrooms1$season==""]=NA
mushrooms1=na.omit(mushrooms1)
mushrooms=copy(mushrooms1)
setDT(mushrooms)
rm(mushrooms1)
library(ggplot2)
library(tidyverse)
setDT(mushrooms)
#str(mushrooms)
mushrooms[,class:=factor(class, levels = c('e','p'),labels = c(0,1))]
mushrooms[,cap.shape:=factor(cap.shape)]
mushrooms[,cap.surface:=factor(cap.surface)]
mushrooms[,cap.color:=factor(cap.color)]
mushrooms[,does.bruise.or.bleed:=factor(does.bruise.or.bleed)]
mushrooms[,gill.attachment:=factor(gill.attachment)]
#mushrooms[,gill.spacing:=factor(gill.spacing)]
#mushrooms[,gill.color:=factor(gill.color)]
#mushrooms[,stem.root:=factor(stem.root)]
#mushrooms[,veil.type:=factor(veil.type)]
mushrooms[,has.ring:=factor(has.ring)]
mushrooms[,ring.type:=factor(ring.type)]
mushrooms[,habitat:=factor(habitat)]
mushrooms[,season:=factor(season)]
mushrooms[,gill.color:=factor(gill.color)]
mushrooms[,stem.color:=factor(stem.color)]
str(mushrooms)
table(mushrooms$class)
ggplot(mushrooms,aes(x=habitat,fill=class))+geom_bar(position='dodge')+scale_x_discrete('Habitat', labels=c("woods","grasses","heaths","leaves","meadows","urban","waste"))+scale_y_discrete('Count') +scale_fill_manual(breaks = c(0, 1),values = c("green3", "red2"))#+scale_fill_manual(values = c('red','blue'))+theme(panel.background = element_blank())
ggplot(mushrooms,aes(y=cap.shape,fill=class))+geom_bar(position='dodge')+scale_y_discrete('Cap-Shape', labels=c("bell","conical","flat","others","spherical","sunken","convex"))+scale_x_discrete('Count')+scale_fill_manual(breaks = c(0, 1),values = c("green3", "red2"))#+scale_fill_manual(values=c('red','blue'))+theme(panel.background = element_blank())
ggplot(mushrooms,aes(x=cap.shape,fill=class))+geom_bar(position='stack')+scale_x_discrete('Cap-Shape')+scale_y_discrete('Count')+facet_wrap(~has.ring)+scale_fill_manual(breaks = c(0, 1),values = c("green3", "red2"))
ggplot(mushrooms,aes(x=cap.shape,fill=class))+geom_bar(position='stack')+scale_x_discrete('Cap-Shape')+scale_y_discrete('Count')+facet_wrap(~has.ring)+scale_fill_manual(breaks = c(0, 1),values = c("green3", "red2"))
ggplot(mushrooms,aes(x=season,y=cap.shape,col=class))+geom_jitter(alpha = 0.5)+scale_color_manual(breaks = c(0, 1),values = c("green3", "red2"))#+scale_color_manual(breaks=c("edible", "poisonous"),values = c("blue", "red"))
ggplot(mushrooms,aes(x=gill.attachment,y=cap.shape,col=class))+geom_jitter(alpha = 0.5)+scale_color_manual(breaks = c(0, 1),values = c("green3", "red2"))
ggplot(mushrooms,aes(x=ring.type,y=cap.shape,col=class))+geom_jitter(alpha = 0.5)+scale_color_manual(breaks = c(0, 1),values = c("green3", "red2"))
library(caret)
library(ROCR)
set.seed(111)
trainindex=createDataPartition(mushrooms$class,p=0.8,list=FALSE,times=1)
#head(trainindex)
train=mushrooms[trainindex,]
test=mushrooms[-trainindex,]
rm(trainindex)
table(train$class)
table(test$class)
log_model=glm(class~.,data=train,family='binomial')
okok=seq(0,1,0.1)
accuracy=function(x){
summa=(x[1]+x[4])/(x[2]+x[3]+x[1]+x[4])
return(summa)
}
recall=function(x){
summa=(x[4])/(x[2]+x[4])
return(summa)
}
for(i in okok){
predicted_log=predict(log_model,test,type="response")
predicted_log=ifelse(predicted_log>i,1,0)
a=table(test$class,predicted_log)
print(accuracy(a))
}
#accuracy.list=c(0.613936,0.6743634,0.7160403,0.7559977,0.7816261,0.7713911,0.727176,0.6864816,0.615246)
accuracy.list=c(0.5875607,0.6567728,0.7231516,0.7481112,0.7394765,0.7054776,0.6868591,0.6454398,0.5769023)
for(i in okok){
predicted_log=predict(log_model,test,type="response")
predicted_log=ifelse(predicted_log>i,1,0)
a=table(test$class,predicted_log)
print(recall(a))
}
#recall.list=c(0.9825882,0.9518961,0.9266637,0.8847573,0.8075845,0.7184595,0.5869854,0.464955,0.3097241)
recall.list=c(0.987326,0.9788767,0.9537773,0.889662,0.7691352,0.6336978,0.5072068,0.3727634,0.2266402)
okok=okok[-1]
okok=okok[-10]
log_df=data.table(okok,accuracy.list,recall.list)
pqr=ggplot(log_df,aes(x=okok,y=accuracy.list))+geom_line(col='blue',size=1)
pqr=pqr+geom_line(aes(x=okok,y=recall.list),size=1)
pqr
predicted_log=predict(log_model,test,type="response")
predicted_log=ifelse(predicted_log>0.5,1,0)
a
a=table(test$class,predicted_log)
a
accuracy(a)
recall(a)
#prediction_lgr_df2 <- predict(model_lgr_df2, data = df2, type="response")
pr_admission <- prediction(predicted_log, test$class)
prf_admission <- performance(pr_admission, measure = "tpr", x.measure = "fpr")
plot(prf_admission, colorize = TRUE, lwd=3)
library(MASS)
linear=lda(class~.,train)
p=predict(linear,train)
ldahist(data=p$x[,1],g=train$class)
train_predict=predict(linear,train)
gg=table(Actual=train$class,Predicted=train_predict$class)
accuracy(gg)
recall(gg)
test_pred=predict(linear,test)
bb=table(Actual=test$class,Predicted=test_pred$class)
bb
accuracy(bb)
recall(bb)
misclas=function(x){
summa=(x[2]+x[3])/(x[1]+x[3]+x[2]+x[4])
return(summa)
}
misclas(bb)
#caret::confusionMatrix(data = predict(tree_1, newdata = test, type = "class"),
# reference = test$class,
# positive = "edible")
library(rpart)
library(rpart.plot)
set.seed(789)
tree_1=rpart(class~.,data=train,method="class")
rpart.plot(tree_1)
test_tree=predict(tree_1,newdata = test,type="class")
b=table(test$class,test_tree)
b
accuracy(b)
recall(b)
#train_tree=predict(tree_1,newdata = train,type="class")
#table(train_tree,train$class)
#table(test_tree)
#levels(train$class) <- c("edible", "poisonous")
#levels(test$class) <- c("edible", "poisonous")
set.seed(143)
tree_base <- rpart(class~., data = train, method = "class",control = rpart.control(cp = 0))
#summary(tree_base)
#summary(tree_1)
rpart.plot(tree_base)
plotcp(tree_base,col='red',lwd=1)
train_basetree=predict(tree_base,newdata = train,type="class")
c=table(train_basetree,train$class)
c
accuracy(c)
recall(c)
test_basetree=predict(tree_base,newdata = test,type="class")
b=table(test$class,test_basetree)
b
accuracy(b)
recall(b)
misclas(b)
bestcp <- round(tree_base$cptable[which.min(tree_base$cptable[, "xerror"]), "CP"],200)
tree_pruned=prune(tree_base,cp=0.001)
rpart.plot(tree_pruned)
train_basetree=predict(tree_pruned,newdata = train,type="class")
d=table(train_basetree,train$class)
d
accuracy(d)
recall(d)
test_basetree=predict(tree_pruned,newdata = test,type="class")
e=table(test$class,test_basetree)
e
accuracy(e)
recall(e)
#rpart.plot(tree_1)
#tree_1=rpart(class~.,data=train,method="class")
misclas(e)
set.seed(899)
library(randomForest)
mush_rf=randomForest(class~.,data=train,importance=TRUE)
mush_rf
library(party)
plot(mush_rf)
rf_base_pred=predict(mush_rf,newdata = test)
base=table(test$class,rf_base_pred)
accuracy(base)
recall(base)
set.seed(8899)
tree_final=randomForest(class~.,mtry=3,ntree=200,data=train,importance=TRUE)
tree_final
rf_pred=predict(tree_final,newdata = test)
xyz=table(test$class,rf_pred)
accuracy(xyz)
recall(xyz)
varImpPlot(mush_rf, sort = TRUE,
n.var = 10, main = "Importance of predictors")
varImpPlot(tree_final, sort = TRUE,
n.var = 10, main = "Importance of predictors")
rf_res=predict(tree_final,newdata=test)
pp=table(test$class,rf_res)
accuracy(pp)
recall(pp)
mush_new=mushrooms[,c(1,4,5,7,9,10)]
set.seed(3322)
trainindex=createDataPartition(mushrooms$class,p=0.8,list=FALSE,times=1)
#head(trainindex)
train1=mush_new[trainindex,]
test1=mush_new[-trainindex,]
rm(trainindex)
table(train1$class)
table(test1$class)
set.seed(555555)
tree_random=rpart(class~.,data=train1,method="class")
rpart.plot(tree_1)
test_tree=predict(tree_random,newdata = test1,type="class")
g=table(test_tree,test1$class)
accuracy(g)
recall(g)
set.seed(789789)
new_rf=randomForest(class~.,data=train1,importance=TRUE)
new_rf
test_tree=predict(new_rf,newdata = test1,type="class")
g=table(test_tree,test1$class)
accuracy(g)
recall(g)